Using ISU-GAN for unsupervised small sample defect detection

Surface defect detection is a vital process in industrial production and a significant research direction in computer vision. Although today’s deep learning defect detection methods based on computer vision can achieve high detection accuracy, they are mainly based on supervised learning. They require many defect samples to train the model, which is not compatible with the current situation that industrial defect sample is difficult to obtain and costly to label. So we propose a new unsupervised small sample defect detection model-ISU-GAN, which is based on the CycleGAN architecture. A skip connection, SE module, and Involution module are added to the Generator, enabling the feature extraction capability of the model to be significantly improved. Moreover, we propose an SSIM-based defect segmentation method that applies to GAN-based defect detection and can accurately extract defect contours without the need for redundant noise reduction post-processing. Experiments on the DAGM2007 dataset show that the unsupervised ISU-GAN can achieve higher detection accuracy and finer defect profiles with less than 1/3 of the unlabelled training data than the supervised model with the full training set. Relative to the supervised segmentation models UNet and ResUNet++ with more training samples, our model improves the detection accuracy by 2.84% and 0.41% respectively and the F1 score by 0.025 and 0.0012 respectively. In addition, the predicted profile obtained using our method is closer to the real profile than other models used for comparison.

Products may have surface defects in the actual industrial production process due to machine errors, worker errors, and production process problems. Surface defects not only affect the aesthetics and performance of the product, resulting in lower user satisfaction but may also be a safety hazard, posing a threat to the life and property of the user. Hence, surface defect detection is an essential part of industrial production.
For a long time, the process of industrial surface defect has relied on manual work, which is not only timeconsuming and laborious but also very subjective, which cannot meet the needs of industrial production with high efficiency and precision. Therefore, automated defect detection technology based on computer vision has been a more popular research direction. Currently, automated defect detection methods based on machine vision mainly include traditional methods and deep learning methods.
Traditional methods. Traditional methods rely on the structural information of the image to detect defects. It usually requires human efforts to design the corresponding detection algorithm based on the characteristics of the defect and the actual application scenario. Current traditional defect detection methods based on machine vision mainly include Gabor filtering 1 , improved local binary pattern algorithm (MB-LBP) 2 , improved Sobel algorithm 3 , etc. Most traditional vision methods rely heavily on specific defect features and are difficult to achieve end-to-end detection. The time and economic cost of designing different inspection algorithms for different defects by hand are very high and require a large number of people with strong expertise, which is challenging to meet industrial production's efficiency and cost requirements. Furthermore, in practice, detection algorithms based on the characteristics of defects seen by the human eye are susceptible to interference from changes in the external environment, making it difficult to achieve satisfactory robustness.

Deep learning methods. With the advent of deep learning, various algorithms based on Convolutional
Neural Networks (CNNs) have achieved surprising results in many subfields of machine vision. Compared to traditional defect detection methods, deep learning methods mostly eliminate the need to manually model methods for defect detection, with many novel models achieving good detection results on specific datasets. For example, Lee et al. 12 proposed a real-time decision-making method for steel surface defect detection based on CNN and class activation maps. Mei et al. 13 used Denoising Autoencoder Networks with Gaussian pyramids to reconstruct defects and combined with multi-scale fusion to detect surface defect in fabrics with good results. Zhong et al. 14 proposed PVANET++ based on Faster R-CNN, which associates the low-level feature map with the high-level feature map to form a new superexpression map for proposal extraction, applied in detecting defects in railway cotter pins. Tabernik et al. 15 designed a two-stage detection model based on the segmentation network and discriminative network. It extracted fine defect profiles on the KolektorSDD dataset. Huang et al. 16 proposed an improved MCue module with UNet to generate saliency images for detecting magnetic tile surface defect. Li et al. 17 proposed an improved UNet with Dense Block module and summation skip connection to detect concrete surface cracks, and the method achieved an average pixel accuracy of 91.59% and an average IoU of 84.53% on the concrete defect dataset. Inspired by UNET and DenseNet, the DefectSegNet proposed by Roberts et al. 18 adopts skip connection within and between blocks, which shows high pixel accuracy in a highquality steel defect datasets.
Current surface defect detection models based on general deep learning can achieve high detection accuracy and real-time requirements, but they mostly require a large number of negative samples and labels for training, which is costly and difficult to implement in industrial applications.

GAN-based defect detection.
Using GAN for surface defect detection is a relatively novel approach, first seen in AnoGAN 19 proposed by Schlegl et al. in 2017. AnoGAN learns a streaming distribution of positive samples in the potential space during the training phase, while the testing phase iteratively finds the nearest vector in this space and then compares the generator output with the original map to find the anomalous region. As the iterative optimization in the training phase was too time-consuming, so the authors proposed an improved version of f-AnoGAN with encoder structure 20 in 2019. f-AnoGAN alleviates the problem of huge time consumption to a certain extent. Other similarly improved versions include Zenati et al. 21 and Akcay et al. 22 . Niu et al. 23 used the original CycleGAN to fix and detect defects. They used much more samples to train the network and it is difficult to obtain stable detection performance in the case of complex defect backgrounds.
In response to the difficulty of obtaining defect samples in industrial applications, Di et al. 24 combined convolutional self-encoder (CAE) and semi-supervised generative adversarial network (SGAN) to propose a semisupervised CAE-SGAN to obtain better detection results with less training of hot-rolled sheet images. He et al. 25 proposed a fusion algorithm based on cDCGAN and ResNet to generate pseudo-labels for unlabelled samples and used it to train a defect detection model, which achieved good results on the NEU-CLS dataset. Zhao et al. 26 proposed a positive sample-based detection method which used a Defect Generation Module to create defects www.nature.com/scientificreports/ for the positive samples and then trained a DCGAN to repair the defects. But how to generate defects close to the true distribution is a more difficult problem. Although current GAN-based defect detection methods can be semi-supervised or unsupervised, they still only perform well on simple uniform textured surfaces. GAN networks that can be applied to complex industrial inspection environments need further research.
Our method. To address the common problems of high annotation cost and difficulty in obtaining training data for deep learning defect detection, we designed an unsupervised ISU-GAN model and an SSIM-based defect extraction method. ISU is an abbreviation of Involution-SE-U, which means a U-shaped structured network using the Involution operator and SE operator. ISU-GAN is essentially an improved version of CycleGAN. The differences from the original CycleGAN network structure include: 1. The generator adopts a UNet-like structure to reduce the possible loss of defective features during the encoding-decoding process of the input image; 2. the SE operator is used for the feature maps of the critical layers to suppress the less important channels; 3. the Involution operator is used for the feature maps obtained by downsampling to meet the demand for different visual capabilities of defective and non-defective regions.
In the training phase, we want to learn to obtain generators that map positive samples (defect-free samples) and negative samples (defective samples) to each other. The defect repair network maps negative samples to positive samples and the defect manufacturing network maps positive samples to negative samples. In the testing phase. We input the test image into the defect repair network in the testing phase. We then use the Structural Similarity Algorithm (SSIM) 27 to compare the original image and the repair image to obtain an SSIM score map with the same resolution as the original image. We finally use the OTSU algorithm 28 to extract the contours of the defects adaptively.
Our method achieves an average accuracy of 98.43% and an F1 score of 0.9792 on the DAGM2007 dataset using only a small number of training samples. It can segment very accurate defect profiles. We also validate the superiority of our ISU-GAN network structure over other commonly used defect detection models and the effectiveness of its main modules through comparative and ablation experiments.
In general, the innovation of our work mainly includes the following two aspects.
Defect detection network. We propose a new GAN defect detection network, ISU-GAN, converging quickly and achieving excellent detection accuracy with a small training dataset.
Defect segmentation method. We propose an SSIM-based defect segmentation method that applies to GANbased defect detection. Without labels required, our method can accurately extract defect contours with the absence of redundant noise reduction post-processing.

Methodology
In that section, we describe the principle of the defect detection method proposed in this paper and the model structure of ISU-GAN. In the training phase, we train ISU-GAN to learn the mapping relationship between negative and positive samples. ISU-GAN is based on the CycleGAN architecture and consists of two cooperating GANs, as shown in Fig. 1. The solid orange line indicates GAN P and the solid blue line indicates GAN N , which are the GANs for repairing defects and generating defects, respectively. The first adversarial network GAN P consists of a Generator G n2p and a Discriminator D p . The input to G n2p is the negative sample set N in the training dataset, which repairs the defective image regions in N and generates pseudo-positive samples P that do not contain defects. The input to the discriminator D p is the true sample P and the pseudo-positive sample P , whose role is to distinguish P from P . Correspondingly, another adversarial network GAN N consists of a generator G p2n and a discriminator D n . The input to G p2n is the positive sample set P in the training dataset, which serves to add defects to the images in P and generate pseudo-negative samples N that contain defects. The input to the discriminator D n is the true negative sample N and the pseudo-negative sample N , whose role is to distinguish N from N.
Based on the cycle consistency criterion of CycleGAN, it is necessary to input P into G p2n to generate quadratic pseudo-negative samples N . We expect N and N to be as similar as possible, i.e. n ≈ G p2n (G n2p (n)), n ∈ N . Correspondingly, N is input into G n2p to generate a quadratic pseudo-positive sample P , p ≈ G n2p (G p2n (p)), p ∈ P.
In the test phase, the test dataset X (containing positive and negative samples) is fed into the defect repair generator G n2p obtained from training. For any sample x ∈ X , the SSIM algorithm is used to compare x and G n2p (x) to obtain the SSIM score map with the same resolution as x (the higher the score means the higher the region's similarity). Then the OTSU adaptive threshold segmentation algorithm is used to segment the SSIM score map to determine whether there are defects in x and extract the possible defect contours.
Network structure. Generator. The Generator is based on the Encoder-Decoder design guidelines and has a general structure similar to UNet, as shown in Fig. 2. After the image is input to the Generator, it is first downsampled by three 3 × 3 convolutional layers to obtain a 256-channel feature map, which is then passed through the SE module to filter the channels of the feature map for importance. Its purpose is to take full advantage of the channel-independent properties of the next Involution module to focus on the more critical channels. Nine consecutive residual blocks follow the Involution layer to improve the convergence of the model. Further on are the symmetrically designed Involution and SE modules, and an upsampling layer implemented by three 4 × 4 transposed convolutions. In particular, to reduce feature loss from the downsampling-upsampling operation, we use a skip connection to aggregate information from the shallow and deep feature maps. So we filter the 64-channel and 256-channel feature maps from the downsampling operation by the SE module, then concate- www.nature.com/scientificreports/ nate them with the feature maps corresponding to the same number of channels from the upsampling operation, and use a 3 × 3 convolutional layer to restore the channel count to its original state.
In the Generator structure, all convolutional layers except at ⋆ carry Instance Norm and ReLU.
Discriminator. The Discriminator uses the PatchGAN structure 30 , containing only four superficial 4 × 4 convolutional layers. The input image is first transformed into a 512-channel feature map by passing through three convolutional layers with a multiplicative number of filters, and then downscaled to a single-channel feature map X by the action of a convolutional layer with a filter number of 1. Each pixel on X represents the discriminator's score of the corresponding location region of the input image. Compared to conventional discriminators, the Discriminator of the PatchGAN structure can discriminate each patch of the input image differently, enabling the extraction of local image features, which is conducive to improving the detail quality of the generated image.
In the Discriminator structure, all convolutional layers come with Instance Norm and LeakyReLU with slope 0.2. LeakyReLU is used instead of ReLU to alleviate the gradient vanishing problem during training.

Skip connection.
To reduce the loss of image detail features due to the downsampling-upsampling process, we performed a skip connection between the 64-channel and 128-channel intermediate feature maps, see Fig. 2. The skip connection in ISU-GAN is to connect the shallow feature map to the deep feature map in the channel dimension (using a Reflection pad to adjust to the exact resolution if the two feature maps have different resolutions). Then A convolution of 3 × 3 is used to restore the feature map with double the number of channels to the original number of channels. In contrast to conventional skip connection, the shallow feature map is rescaled for channel importance before channel-connection, using the SE Block. The benefit of adding the SE module to www.nature.com/scientificreports/ the skip connection is that it provides a better aggregation of the essential features of the shallow feature maps, allowing the model to extract defect profiles with enhanced power.
Squeeze-and-excitation block. Squeeze-and-excitation block is a module proposed in Ref. 31 that learns the relationship between individual feature channels to obtain the weight of each channel, thus rescaling the importance of all channels. It allows the model to focus more on channels with important information and suppress non-important ones. The flow chart of SE Block is shown in Fig. 3.  www.nature.com/scientificreports/ Squeeze. The Squeeze operation performs feature squeezing on each channel of the feature map, converting the two-dimensional map into a real number that aggregates all the features on the channel. In this case, global average pooling is used to implement the squeeze operation, as in Eq. (1).
Excitation. The Excitation operation aims to learn the interrelationships between the different channels of the feature map and evaluate each channel's importance. Two successive 1 × 1 convolutions with a filter number of c α and c, where α is the channel downscaling factor to reduce the network parameters. After two convolutions and ReLU activation, the c × 1 × 1 vector representing the importance of each channel is then mapped between 0 and 1 using the Sigmoid function. The process is as in Eq. (2).
Finally, the channel importance vector z obtained from learning is multiplied by the original feature map x to obtain the rescaled feature map x , i.e. x = z · x . The SE Block has four applications in our Generator network (as shown in the red part of Fig. 1), two before the skip connection and two in the middle layer of 256 channels.
Involution block. The traditional convolution operator has two main properties: space-independence and channel-specificity. While its space-independence makes convolution efficiency guaranteed, it deprives the convolution kernel of the ability to adapt to different patterns in different regions. The problem of channel redundancy within the convolution has not been solved even in many well-known CNN networks.
At the recent CVPR2021, the Involution module 32 was proposed to address this problem. The involution operator, which has space-specificity and channel-independence in contrast to convolution, uses the kernel generation function φ to generate different convolution kernels for different location regions of an image. The Involution operator gives the network different visual patterns based on different spatial locations.
The shape of the Involution kernel H depends on the size of the input feature map x, and the kernel generation function generates H based on specific pixels.
where W 1 and W 2 represent linear transformations and σ denotes BN and ReLU. W 1 reduces the representation of location-specific pixels of c × 1 × 1 to c r × 1 × 1 (r represents the reduction ratio), which W 2 then changes to G × k × k . G is the number of channels in each group, and all channels in the group share the parameters of the kernel H, which is typically set to 16. Finally, the generated kernel H performs a single-step convolution operation on a specific pixel region.
For surface defect detection, the use of the Involution module meets the need for different visual capabilities in different areas of the image (defective and non-defective regions), allowing the model to extract more realistic defect contours.

Structural similarity. Structural similarity (SSIM) is an algorithm that measures the similarity of two
images, taking into account the image's brightness, contrast, and structural characteristics. SSIM measures these differences through the luminance comparison function l(x, y), the contrast comparison function c(x, y) and the structural comparison function s(x, y), respectively.
where µ x , σ x , and σ xy denote the mean of x, the variance of x, and the covariance of x and y, respectively. To simplify the form, let C 3 = C 2 /2 . The SSIM exponential function is expressed as Eq. 7.
It is better to find the SSIM index locally than globally in image quality assessment. Thus the mean, variance, and covariance in the above equations are calculated in the local area within the sliding window. The final global SSIM score is the average of the scores of all the local regions within the sliding window. The size of SSIM window is a hyperparameter. Through experimental comparison, we set it to 9. The SSIM algorithm can be used (2) z = Sigmoid W 2 ReLU W 1 y .
SSIM x, y = l x, y · x, y · s x, y = 2µ x µ y + C 1 2σ x σ y + C 2 www.nature.com/scientificreports/ not only to measure the similarity of two images but also as a loss measure during model training, called SSIM loss. SSIM loss has the advantage of fast training convergence, so this paper uses SSIM loss in the pre-training phase to reduce the required training time.

Loss function. In ISU-GAN, we use three loss functions types: Adversarial Loss L GAN , Cycle Consistency
Loss L cycle and Identity Loss L identity .
Adversarial loss. L GAN is divided into L GAN_G and L GAN_D in terms of specific implementations, which represent the optimization targets of the generator G and the discriminator D, respectively. The adversarial loss is measured using L2 loss, as shown in Eqs. (1) and (2), where 0 and 1 represent the full 0 tensor and the full 1 tensor, respectively. G wants the generated fake samples to deceive D, i.e. the fake input samples make the discriminator output as close to 1 as possible. On the contrary, D wants to distinguish between real and fake samples as much as possible. Thus when the input is a real sample, D wants its output to be as close to 1 as possible. While for a fake sample, the output is as close to 0 as possible.
Cycle consistency loss. We want the samples obtained from the real samples after sequentially going through a forward mapping and a reverse mapping to be as consistent as possible with the original samples to improve the stability of the generated model, i.e. G n2p (G p2n (p)) ≈ p and G p2n (G n2p (n)) ≈ n . We use the Cycle Consistency Loss L cycle to measure this similarity. In particular, to combine the advantages of fast convergence of SSIM loss and high detail fidelity of L1 loss, we use a loss function replacement strategy for L cycle . We first train k epochs using SSIM loss to allow accelerated convergence, and then replace it with L1 loss to optimize the detail of the generated images, as shown in Eq. (10), where we empirically set k to 10.

Identity loss.
To reduce the probability of predicting a positive sample as a negative sample, we want the defect repair generator G n2p not to change the positive sample too much. To avoid unnecessary interference noise, we expect p to be as similar as possible to G n2p (p) . We use the Identity Loss L identity to measure this degree of dissimilarity.L identity uses the same loss function replacement strategy as L cycle , as shown in Eq. (7).

Experiment
Dataset. DAGM2007 33 is a well-known dataset for industrial weakly supervised defect detection, which contains ten artificially produced texture defects. This dataset is downloaded from https:// hci. iwr. uni-heide lberg. de/ node/ 3616. Each class is divided into a training set and a test set. All images in DAGM are grey-scale images of 512 × 512, where the defect images are labeled with weak supervision. We selected three of these representative classes (as in Table 1) for our experiments. Class 1 has more diverse surface texture. Class 6 has messier surface texture. Class 7 has sliver defects. We chose these three classes to test the robustness of ISU-GAN for diverse textures, messy textures, and sliver defects respectively. The defect images for the three used classes are shown in Fig. 4.

Evaluation metrics.
In the comparison experiments in this paper, we use Accuracy (Acc) and F1-score to compare the defect detection effectiveness of the different models. In the ablation studies, we use F1-score and MSE to examine the impact of different modules on network performance.
Here we define TN: predicted defective sample and actually defective sample; FN: predicted defective sample but actually non-defective sample; TP: predicted non-defective sample and actually non-defective sample; FP: predicted non-defective sample but actually defective sample.
Accuracy. Accuracy is defined as the proportion of all correctly predicted samples, as in Eq. (2). www.nature.com/scientificreports/ F1-score. F1-score is a statistically significant measure of the accuracy of a dichotomous model, defined as the summed average of Precision and Recall:

MSE.
In our ablation studies, we use Mean Square Error (MSE) to measure the similarity between the pseudopositive samples restored by the defect repair generator and the original positive samples. Its lower value indicates that the reconstructed image is closer to the original one in detail. We do not use negative samples when calculating the MSE because the better the repair is for the defective region, the higher the MSE will be. For this paper, the MSE is calculated as the average of all positive samples.
(12) Accuracy = TN + TP TN + FN + TP + FP × 100%.   To improve the convergence of the model, we resize the input image from 512 × 512 to 256 × 256, and the interpolation method used is bicubic 34 . To improve the robustness of the model, the batch size is set to 1, and all input images are performed with equal probability in one of the following three operations: (1) keeping constant, (2) flipping horizontally, and (3) flipping vertically. Our network was trained from the beginning for all experiments, using the optimizer Adam 35 , with an initial learning rate of 0.0002 and a training epoch of 100. In the comparison experiments section, we will compare the performance of ISU-GAN with commonly used defect detection segmentation models (UNet, ResUNet++) and the classic GAN networks (original CycleGAN, DCGAN) for defect detection and segmentation. In the ablation studied section, we will compare the impact of each ISU-GAN module on the network performance.
Comparison experiments. In this section, we compare the defect detection and segmentation performance of our ISU-GAN with some models. The models used for comparison include the classical GAN networks CycleGAN and DCGAN, the commonly used semantic segmentation models UNet and its improved version ResUNet++. UNet is one of the classical models of semantic segmentation, often used as a benchmark model for various segmentation tasks, and it is also widely used in the field of defect detection 17,18 . ResUNet++ is a relatively new member of the UNet family, which combines the advantages of ResNet and UNet, and introduces SE blocks to show more powerful image segmentation capabilities. In section related works, We mentioned that CycleGAN 23 and DCGAN 26 have been implemented for the DAGM dataset with good results, so we chose these GAN for comparison. The experiment results in the test stage are shown in Fig. 5 and Table 2.
From the experimental results, it can be seen that despite using less than one-third of the training data of the other models and without labels, ISU-GAN still shows an improvement of more than 2.5% in the average two metrics compared to UNet. ResUNet++, an improved version of UNet, performs markedly better than UNet in all categories, but its Acc and F1 are lower than ISU-GAN by about 0.4% and 0.1%. In contrast, when comparing the detection results of CycleGAN and DCGAN, ISU-GAN has significantly improved in all categories of data, with over 1.5% and 3.0% improvement on average. By comparing the test data of each model, it can be verified that our method is effective.
It is worth mentioning that ISU-GAN performs significantly worse than ResUNet++ on Class 1 and is on the lower level of all Classes. The possible reason is that the wide variety of background textures in Class 1 makes it harder for our model to find the positive and negative sample mapping relationships that we expect.
As can be seen from Fig. 5, even without using labels during training, our model is more finely and accurately segmented for defects than Supervised Learning-based UNet and ResUNet++, which will benefit workers in the manufacturing industry to determine the type of defects. With the same unsupervised training, DCGAN method needs to manually create defects for the images, which is more tedious. While our method omits this procedure and has significantly better results. We also compare the defect repair results of ISU-GAN and CycleGAN, see Fig. 6. It can be observed that the repair map generated by ISU-GAN is closer to the original image in detail, especially the texture at the edges is smoother and more realistic.
Ablation studies. Ablation studies were set up to investigate the impact of three crucial modules (skip connection, Involution, SE) in the generator structure of ISU-GAN on the effectiveness of defect detection. The Generator models compared in the ablation experiment are: 1. the original CycleGAN (default); 2. using only one of the three modules; 3. using all three modules (ISU-GAN).
The dataset and hyperparameters used for the ablation experiments are the same as section comparison experiments, and all submodels use the method proposed in section methodology to detect defects. The results of the experiments are shown in Table 3.
On average, the improvement of skip connection for the model lies mainly in the significant reduction of MSE, but the improvement of F1-score is not apparent. In contrast, the Involution Block improves the F1-score significantly but also increases the MSE noticeably, while the SE Block optimizes both values to a lesser extent. For the ISU-GAN with all three modules, we can see that it achieves the best results in both average values, and the improvement is significant compared to the original CycleGAN. It indicates that the ISU-GAN model structure is reasonable and practical.

Conclusion
From the results of this paper, our proposed defect detection model ISU-GAN and the associated defect extraction method can perform well under unsupervised conditions with a small number of training samples. ISU-GAN innovatively uses skip connection, SE block and Involution Block in the generator to obtain better defect feature characterization. Furthermore, the SSIM-based defect extraction method can extract more accurate defect profiles.
Through comparison experiments, we show that ISU-GAN can achieve a better defect detection effect even if the training conditions are much weaker than UNet and ResUNet++. Through ablation studies, we show the www.nature.com/scientificreports/  www.nature.com/scientificreports/ impact of the three main modules of ISU-GAN on the network performance and verify the effectiveness of the ISU-GAN structure.
In section comparison experiments, we mentioned that ISU-GAN performs significantly worse than other classes due to difficulty mapping positive and negative samples in data sets with richer texture types. According to this problem, we will further optimize the network structure to obtain a more robust performance in the subsequent work.

Data availability
Datasets used in this study are available to download at: Datasets used in this study are available to download at: https:// hci. iwr. uni-heide lberg. de/ node/ 3616.